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Abstract 

AUC (area under ROC curve) is an important evaluation criterion, which has been popularly used 
in diverse learning tasks such as class-imbalance learning, cost-sensitive learning, learning to rank 
and information retrieval. Many learning approaches are developed to optimize AUC, whereas 
owing to its non-convexity and discontinuousness, almost all approaches work with surrogate loss 
functions. Therefore, the study on AUC consistency is crucial, and the previous study showed 
that classification calibration is necessary and sufficient for the consistency of AUC. 

In this paper, we show that, for pairwise surrogate loss of AUC, minimizing the expected risk 
over the whole distribution is not equivalent to minimizing the conditional risk on each pair 
of instances. We disclose that classification calibration is necessary yet insufficient for AUC 
consistency, and provide a new sufficient condition for the asymptotic consistency of learning 
approaches based on surrogate loss functions. Based on this finding, we prove that exponen- 
tial loss, logistic loss and distance-weighted loss are consistent with AUC. Then, we derive the 
q-norm hinge loss and general hinge loss that are consistent with AUC. We also derive the con- 
sistent bounds for exponential loss and logistic loss, and obtain the consistent bounds for many 
surrogate loss functions under the non-noise setting. Furthermore, we disclose an equivalence 
between the exponential surrogate loss of AUC and exponential surrogate loss of accuracy, and 
one straightforward consequence of such finding is that AdaBoost and RankBoost are equivalent. 
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1. Introduction 



AUC (area under ROC curve) is an important evaluation criterion which exhibits strong robust- 
ness to the change of class distribution, and thus ca n be ado p ted ev en when classical criteria such 
as accuracy, precision, recall, etc. are inadequate 



PFK98 



PF01]. It has been widely used in 



many learning tasks such as cos t -sensitive learning, class-imbalance learning, learning to rank, in- 



formation retrieval, etc. 



FHORll| . 



ElkOl 



FISS03, 



CM04 



BBB+07 . 



AM08 



CVOS, 



CVDOS, 



RS09 



KDH11 



Owing to its non-convexity and discontinuousness, it is not easy, or even infeasible, to optimize 
AUC directly since such direct optimization often leads to NP-h ard prob l em. In stead, surrogate 



oss functions are usua lly optimized, such as exponential loss 



BS05 . 



Joa05 



FISS03, 



RS09| | and hinge loss 



ZHJYIll ]. Minimizing such losses is generally easy, and can be done in polynomial 
time. An important question then is how well does minimizing such convex surrogate losses lead 
to minimizing the actually AUC; in other words, does the expected risk of learning with surrogate 
loss functions converge to the Bayes risk of AUC? Consistency (also called Bayes consistency) 
guarantees that optimizing a surrogate loss will yield ultimately an optimal function with Bayes 
risk. Thus, the above problem, in a formal expression, is whether the optimization of surrogate 
loss functions is consistent with AUC. 



1.1. Our Contribution 

Previous study shows that classification calibration is necessary and sufficient for the consistency 
of AUC, whereas we find that it ignores an important prerequisite, that is, for pairwise surrogate 
loss of AUC, minimizing the expected risk over the whole distribution is not equivalent to mini- 
mizing the conditional risk on each pair of instances. We prove that classification calibration is 
necessary yet insufficient for AUC consistency, e.g., hinge loss and absolute loss are classification- 
calibrated whereas they are inconsistent with AUC. We further provide a new sufficient condition 
for the asymptotic consistency of learning approaches based on surrogate loss functions. Based on 
this finding, we prove that exponential loss, logistic loss and distance- weighted loss are consistent 
with AUC. Then, we derive the q-norm hinge loss and general hinge loss that are consistent with 
AUC. We also derive the consistent bounds for exponential loss and logistic loss, and obtain the 
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consistent bounds for many surrogate loss functions under the non-noise setting. Further, we 
disclose an equivalence between the exponential surrogate loss of AUC and exponential surro- 
gate loss of accuracy, and one straightforward consequence of such finding is that AdaBoost and 
RankBoost are equivalent. 



1.2. Related Work 



The studies on AUC can be traced back to 1970's in signal detection the ory 
been widely used as a criterion in medical area and machine learning 



PFK9 



E^ a7^1. an d it has 



PF01 



Elk01| . 



especially for mo del se 
and em pirically 



ran accu racy theoretically 



ZOM02j |. semi-parametric 



ection where AUC exhibits a better measure t 
HL05I ]. A UC can be estimated under parametric 
HT96I ] and non-parametric HM82I ] assumptions, and the non-par ameteric estimation of AUC 
is popularly applied in machine learning and da ta minin g, equiy alent to the Wilco xon-Mann- 
Whitney (WMW) statistic test of ranks [HM82I ]. Hand [HanO^ and Flach et al. [FHOBllI ] 
gave the incoherent and coherent explanations of AUC as a measure of aggregated classifier 
performance, respectively. 



ing to rank, especially for bipartite ranking 



CSS99. 


FISS03. 


CM04. 


BS09. 


Rud09 



Generalization bounds are presented to understa nd the prediction beyond the training sample 



AGH+05 



UAGOa . 



CMR07, 



ied by Agarwal and Roth 



CLV08 



AN09 



RS09]. Also, the lear nability o f AUC has been stud 



AR05I ]. More recently, Kotlowski et al. KDHllI ] introduced univariate 



surrogate loss functions to optimize bipartite ranking. 
Breiman 



Bre04J ] initiated the consistency issue and showed that exponential loss converges to the 
Bayes cl assifier for arcing-style greedy boosting algorithms in the infinite sample case. Buhlmann 
and Yu |BY03l ] studied the consistency of boosting algorithms with res pect to l east sq uare loss. 



The consistent theory for support vector machines are developed in 



LinOl 



Ste05f |. and the 



influential and fun damental work for binary c 
by Zhang 



assification has been investigated comprehensively 
Zha04bl ] and Bartlett et al. [BJM06I ]. in which many famous algorithms (e.g., boosting, 
logistic regression and SVMs) are proved to be consistent. Furt hermore, the consistent theory on 
multi-class classification has been addressed in 



Zha04a 



TB07] and many SVM-style algorithms 



are proved to be inconsistent. More recently, the consistency on multi-label learning has been 
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studied by Gao and Zhou 
on learning to rank CZ08I . 



GZ11 |. Much attention has also been paid to the consistency analysis 

XLW+oa . IxLLod . Idmjio ]. 



In contrast to previous studies on consistency Zha04al . IZha04bl . IB JMOa . ITB07I . IGZ 1 ll | that focused 
on single instances, our work concerns about the surrogate loss functions of AUC that focused on 
a pair of instances from different classes. This crucial difference leads to the fact that in contrast 
to previous studies that are sufficient to focus on conditional risk, our study on AUC consistency 
has to consider the whole distribution, because as to be shown in Lemma [H minimizing the 
expected risk over the whole distribution is not equivalent to mini mizing th e conditional risk. 



CLVOa Section 7 pp. 



This is a challenge for the study on AUC consistency. Previous study 
suggeste d to analyze the AUC consistency by directly extending the results of Bartlett et al. 
BJM06I ]. i.e., classification calibration is necessary and sufficient for AUC consistency. Our study 



shows that cla ssification calib ration i s necessary yet insufficient for AUC consistency. Kotlowski 



et al. K DHllI ] and Agarwal [Agal2l | studied AUC via minimization univariate losses, which is 



different from the pairwise surrogate losses of our concern. 



Duchi et al. DMJ1CJ ] studied the consistency of supervised ranking, but it is quite different from 
our work. Firstly, the problem settings are different: they considered "instances" consisting of 
a query, a set of inputs and a weighted graph, and the goal is to order the inputs according to 
the weighted graph; yet we consider instances with positive or negative labels, and the goal is 
to rank positive instances higher than negative ones. Moreover, they established inconsistency 
for the logistic loss, exponential loss and hinge loss even in low- noisy setting, yet our work shows 
that the logistic loss and exponential loss are consistent but hinge loss is inconsistent. 



Rudin and Schapire RS09I ] established the equivalence between AdaBoost and RankBoost in the 
asymptotic behavior (iteration number converges to infinity) for finite training sample when the 
negative and positive classes contributed equally. In Section El we derive an equivalence between 
the exponential surrogate loss of AUC and the exponential surrogate loss of accuracy when the 
size of training sample approaches to infinity; this provides a new explanation to the asymptotic 
equivalence between AdaBoost and RankBoost. 
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1 . 3. Organization 



Section [2] introduces some preliminaries and previous studies on AUC consistency. Section [3] 
shows that classification calibration is necessary yet insufficient for AUC consistency, and we 
present a new sufficient condition. Section H] studies consistent bounds. Section [5] discloses the 
equivalence between the exponential surrogate losses of AUC and accuracy. Section [6] presents 
detailed proofs. Finally, Section [7] concludes and raise some open problems. 

2. Preliminaries 

Let X denote an instance space and y = {+1, —1} the label set. We denote by T> an unknown 
(underlying) distribution over X x y, and T>x represents the instance-marginal distribution over 
X. For convenience, the conditional probability t)\ X — > [0, 1] is defined as 

n{x) = Vi[y = +\\x\. 

We consider a training sample of ri\ positive instances and 712 negative instances 

S = {(xi,+l), (x ni , +1), {x' x , -1), . . . , « 2 , -1)} 

drawn identically and independently according to distribution T>. Let /: X — > R be a score 

function. Then, the AUC with respect to sample S and function / is defined as 

mn 2 , 

auc(/,s) = ( J [/(^) > + 2 I[f{Xi) = /( ^)])' 

i=l j=l 

where /[•] is the indicator function which returns 1 if the argument is true and otherwise. 

Optimizing the AUC is equivalent to minimizing the empirical risk 
1 m m 

R(f,s) = — EE^/'^X-)' 

i=i j=i 

where the loss function £(f,Xi,x'j) = I[f{x.{) < f{x'-)} + [f(x,{) = f{x'-)]/2 is also called ranking 
loss. It is easy to get 

AVC(f,S)+R(f,S) = l. 

We define the expected risk of function / as R(f) = Es[R(f, S)], which is equivalent to 

R(f) = E x , x >~v% [»7(as)(l " V&'Mf, + V&)0- ~ v(*))t(f, «=)]■ (1) 
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Denote by the Bayes risk 



R* = M[R(f)}, 

where the infimum takes over all measurable functions. By simple calculation, we can get the set 
of optimal functions, also called set of Bayes predictors: 



B = {f: R(f) = R*} = {/: (f(x) - f(x'))(r,(x) - V (x')) > if V (x) ± V (x')}. 



(2) 



Notice that the ranking loss I is non-convex and discontinuous, and a direct optimization often 
leads to NP-hard problems. In practice, surrogate loss functions that can be optimized with 
efficient algorithms are usually adopted. Throughout this paper, we consider the following for- 
mulations of pair-wise surrogate loss functions: 

V(f,x,x') = <t>(f(x)-f(x')), 



where (f) is a convex 
max(0, 1 — t) 



BS05 



'unction, e.g., ex ponential loss <p(t) = e t [FISS03I . IRS09I ]. hinge loss (j)(t) 



Joa05 



ZHJYIll ]. etc. Similarly, we define the expected 0-risk as 



Mf) = E x ,^v%[v(x)(l - ^(x'mfix) - f(x')) + V (x')(l - Ti(x))tf>(J(x') - f(x))], (3) 
and denote by the optimal expected 0-risk, 

where the infimum takes over all measurable functions. 

Many notion s on consistency have been i ntroduce d in the literature, e.g., th e Fisher consis - 
tency 



Lin02l ] . in finite-sam ple consistency 



edge-consistency 



DMJIOI ]. multi-label consistency 



Zha04al) . clas sification calibration 



BJM06, 



TB071, 



GZlll j. etc. In this paper, we define formally 



the A UC consistency as follows: 

Definition 1 The surrogate loss 4> is to be consistent with AUC if for every sequence 
{f^ n \x)} n >i, the following holds over all distributions V on X x y.- 



R^(f {n) ) -> R% then R(f&>) -> R*. 
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For two given instances x and x' , we define the conditional 0-risk as 



C(x, x', a) = rj(x)(l - n(x'))4>(a) + i](x')(l - n(x))4>(-a), 



(4) 



where a = f(x) — f(x'), and we have 

For convenience, we denote by r\ = r\{x) and rj = r](x'). Then, we define the optimal conditional 
0-risk 

H(r),r)') = inf C(x, x',a) 

= inf {77(1 - r/)<f>(a) + r/'(l - r/)<^(-a)} , 

and further define 



H-(r],r]')= inf {77(1 — ri')(j){a) + rj'(l - rj^-a)} 

a : a(r]—r]')<0 



Motivated from 



BJM06l ]'s work, we define the classification calibration of AUC as follows: 



Definition 2 The surrogate loss <fi is said to be classification- calibrated if 



H (77, rj') > H (77, 77') for any 77 7^ 77'. 



Clemencon et al. [CLV08I . pp. 846] suggested to study the consistency of AUC through a direct 



extension of results of Bartlett et al 
holds for AUC consistency from 



BJM06I . Theorem 3]; in other words, the following theorem 



BJM061. 



Theorem 1 

alent: 



BJMOa . Theorems 1 and 2] For convex surrogate loss <j), the followings are equiv- 



<f> is classification- calibrated. 



(j) is differential at t = and 4>'(0) < 0. 



is consistent with AUC. 
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This seems that classification calibration completely characterizes the consistency of learning 
algorithms based on convex surrogate loss, and such results are exactly parallel to those of 
classification. However, our Lemma [T] shows that this study ignores an important prerequisite: 
Minimizing the expected <^>-risk RAf) over the whole distribution is not equivalent to minimizing 
the conditional 0-risk C(x, x', a) on each pair of instances. Therefore, it is not correct to directly 
use classification calibration to study the consistency of AUC, and as a matter of fact, classifica- 
tion calibration is proven to be a necessary yet insufficient condition for AUC consistency in the 
next section. 

3. AUC Consistency 

Recall that 

R% = inf R<f,(f) = inf E xx ,^ V 2 x C(r](x),r](x'),a), 
and it is easy to get that 

B% = iriR4(f)>E x ^ 7 %MC{ri(x)Md),a). (5) 

It is noteworthy that the equality in Eqn. ([5]) does not hold for some surrogate losses, which can 
be shown by the following lemma: 

Lemma 1 For hinge loss <fi(t) = max(0, 1 — t), it holds that 
infJfy(/) > E ! inf C(rj(x),rj(x'),a). 

f > X a 

Proof: We prove by contradiction. Suppose that there exists a function / such that 

R<t>U) = E x , x >~v% I inf C (ri(x),ri(x'),a)]. 
For simplicity, we consider three different instances xi,X2,x% £ X such that 

r]{xx) < r](x 2 ) < 77(033). 
The conditional risk of hinge loss is given by 

C(x, x', a) = rj(x)(l — r](x')) max(0, 1 — a) + rj(x')(l — rj(x)) max(0, 1 + a), 
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and minimizing C(x,x',a) gives a = — 1 if r/(x) < r/(x'). From the assumption that 

R<p(f) = E x ,^ V 2 inf C(r/,r/,a), 

we have /(xi) - f(x 2 ) = -1, f{xi) - f(x 3 ) = -1 and f(x 2 ) - f(x 3 ) = -1; while they are 
contrary to each other. Hence the lemma holds. □ 

In a similar manner, we can prove that the following inequality holds for least square loss 4>(t) = 
(1 — t) 2 , absolute loss (p(t) = |1 — 1\, least square hinge loss 4>(t) = (max(0, 1 — i)) 2 , etc., 

MR^f) > E x ,~v\ ' inf C{r]{x),r](x'),a). 

That is, minimizing the expected 0-risk R^f) over the whole distribution is not equivalent to 
minimizing the conditional c/>-risk C(x,x',a) on each pair of instances. Therefore, Lemma [J 
discloses that, for AUC consistency, we should focus on the expected </>-risk over the whole 
distribution rather than conditional c^-risk on each pair of instances. Classification calibration, 
however, is heavily based on conditional c/>-risk, and ignores the expected 0-risk over the whole 
distribution; therefore, it is not correct to directly use classification calibration to study AUC 
consistency. 

3.1. Classification Calibration is Necessary yet Insufficient for AUC Consistency 

We first prove that hinge loss <p(t) = max(0, 1 — t) is inconsistent with respect to AUC by the 
following theorem: 

Theorem 2 For hinge loss <p(t) = max(0, 1 — t), the surrogate loss ^(/, x, x') = (p(f(x) — f(x')) 
is inconsistent with AUC. 

Proof: For simplicity, we consider three distinct instances x\, x 2 ,x 3 , i.e., X = {xi, x 2 , x 3 }, 
with marginal probability Pr[a?j] = 1/3. Further, we set fi = f(xi) and conditional probability 
rji = n(xi) such that 

m <m < r/3, 27/2 < vi + m, and 2r /i > m + mm- 
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From Eqn. ([3]), we have 

Mf) = Co + dfa(l - 772) max(0, 1 + f 2 - fi) + 772(1 - 771) max(0, 1 + /1 - / 2 )} 
+C 1 {r/i(l - Tfe) max(0, 1 + / 3 - /1) + r/ 3 (l - m) max(0, 1 + /1 - / 3 )} 
+Ci{r ?2 (l - 7/3) max(0, 1 + / 3 - / 2 ) + r/ 3 (l - r/ 2 ) max(0, 1 + f 2 - f 3 )}, 

where Co = 2(7/1 + 7/2 + t/3 ~ Vi ~ V2 ~~ vi)/® an d Ci = 2/9. Minimizing R<j>{f) gives 

-RJ = Co + Ci(3t?i + 3r/ 2 - 2771 r/ 2 - 2771/73 - 2772773) 

when /* = (/i,/ 2 ,/ 3 ) s.t. /* = f% = /| — 1. Notice that the optimal solution should not be 
/' = (/(> /2> /a) s -t- /1 + 1 = f'2 = /a - !> because 

R<p(f) = C + C 1 (2t ?1 (1- r?,) + 37/1(1-773) + 27,2(1 -773)) 

= C + Ci(57?i + 2t?2 - 2771 7/2 - 37/17/3 - 27/27/ 3 ) 

= r; + Ci(2t/i- 772- 771773) >r; 

where we use the condition 2t/i > 7/2 + 7/17/3. 

We now construct a sequence {/^} n >i by choosing /^'(aJi) = /' 1 '(a5 2 ) = /'^(a^a) — 1 and 
/< n >(aj) = / (1> («) for n > 1. Then, it holds that 

/^(/< n >) = yet #(/<»>) - i?* = Ci(t/ 2 - 77O/2 for n > 1. 
Therefore, there exists a sequence {/' n '}n>i such that 

^(/ <n> )^^yet R(f^)^R*, 
which completes the proof. □ 

Another relevant loss, the absolute loss <p(t) = \l — t\, is also proven to be inconsistent with AUC 
as follows: 

Theorem 3 For absolute loss 4>(t) = |1 — t\, the surrogate loss fy(f,x,x') = <p(f(x) — f(x')) is 
inconsistent with AUC. 
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Proof: Similarly to the proof of Theorem [2j we consider X = {x\, ai 2 , #3} with marginal proba- 
bility Pr[a3j] = 1/3, and set fi = f(xi) and conditional probability rji = rj(xi) such that 

rji < r/ 2 < rj 3 and 2t/ 2 >m + V3- 
Prom Eqn. ([3]), we have 

RM) = Co + Ca{77i(l - m)\l + /a - fi\ + %(1 - t/i)|1 + /1 - / 2 | + 171(1 - r? 3 )|l + / 3 - /i| 
+ 773(1 - 77l)|l + fi- /a I + mQ- ~ %)|1 + h ~ M + %(1 - %)|1 + / 2 - /s|}, 
where Co > and Ci > are independent to /. Minimizing R$(f) gives 
Rl = C + Ci(4r/i + r/2 + r? 3 - 2^772 - 27717/3 - 2/72/73) 

when /* = (/j*, /|, /I) s.t. / * = /| — 1 = /| — 1. Notice that the optimal solution should not be 
/' = (/!, /a. /s) s -t- /1 + 1 = /a = /a - 1» because 

WO = Co + Ci(2r ?1 (l-r/2) + 3r/ 1 (l-r73) + r73(l-r ?1 ) + 2r ?2 (l-7?3)) 
= C + Ci(5/7i + 2/72 + 7/3 - 2/71772 - 4/71/73 - 27/27/3) 

= ^; + Ci(/7i + /? 2 - 2/71/73) >r; 

where we use 771 + 772 — 2771/73 > 772 - 771/73 > (7/1 + 77 3 )/2 - 771/73 > 0. 

We can construct a sequence {/^ n '} n >i by choosing f( l '(x\) = /^'(a^) — 1 = f^'fas) — 1 and 
/< n >(aj) = f {l) {x) for n > 1. Then, it holds that 

i^(/<">) = i?; yet R(f^) -R* = Ci(r, 3 - m )/2 for n > 1. 
Therefore, there exists a sequence {/^ n '}n>i such that 

^(/ <n> )^;yet R(f^)^R*, 
which completes the proof. □ 

It is noteworthy that hinge loss 4>(t) = max(0, 1 — t) and absolute loss <fi(t) = |1 — t\ are convex 
and (j)'(0) = —1 < 0, and they are classification-calibrated, whereas Theorems [2] and [3] show their 
inconsistency with AUC, respectively. Therefore, classification calibration is no longer a sufficient 
condition for AUC consistency. 
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Corollary 1 Classification calibration is not sufficient for AUC consistency. For convex function 
4>, the condition that <p(t) is differential at t = with </>'(0) < is not enough for AUC consistency. 

Though classification calibration is not sufficient for AUC consistency, it can be proven to be a 
necessary condition as follows: 

Lemma 2 If the surrogate loss <p is consistent with AUC, then 4> is classification-calibrated, and 
for convex (ft, it is differential at t = with 4>'(0) < 0. 

The detailed proof is presented in Section \6. 11 Based on Corollary Q] and Lemma [2j we derive our 
first main result: 

Theorem 4 Classification calibration is necessary yet insufficient for AUC consistency. 



This theorem shows that the study on AUC consistency is not similar to that of classifica- 
tio n where classification calibration is necessar y and s ufficient for the consistency of 0/1 loss 



m 



BJM0 6]. In contrast to Clemencon et al. [CLV08I ] where hinge loss and absolute loss are 



consistent with AUC, our results disclose their inconsistency. 



3.2. Sufficient Condition for AUC Consistency 

In the previous section, we have shown that classification calibration is no longer sufficient for 
AUC consistency, and therefore, it is necessary to suggest a new sufficient condition. Meanwhile, 
this new sufficient condition must be based on classification calibration from Lemma [2j We now 
present a new sufficient condition as follows: 

Theorem 5 The surrogate loss ^(/, x, x') = 4>(f(x) — f(x')) is consistent with AUC if (ft: R — > R 
is a convex, differentiable and non-increasing function with (ft'(0) < 0. 

This detailed proof is deferred to Section HT2l Based on this theorem, it is easy to get: 

Corollary 2 For exponential loss 4>(t) = e~ l , the surrogate loss "if(f,x,x') = (p(f(x) — f(x')) is 
consistent with AUC. 
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Corollary 3 For logistic loss <f>(t) = ln(l + e *), the surrogate loss ty(f,x,x') = 4>{f{x) — f(x')) 
is consistent with AUC. 



Marron et al. [MTA07I ] introduced the distance-weighted discrimination method to deal with the 



problems with 
Bartlett et al. 



ugh dim ension yet small-size sample, and this method has been reformulated by 



BJM06I ]. for any e > 0, as follows: 



| for t > e, 



m = { - ( 6 ) 

i (2 — \) otherwise. 
Based on Theorem [5j we can also derive its consistency as follows: 

Corollary 4 For distance-weighted loss (ft given by Eqn. ([6]) with e > 0, the surrogate loss 
^(f,x,x') = 4>(f( x ) — f( x ')) is consistent with AUC. 



It is noteworthy that the hinge loss <p{t) = max(0, 1 — t) is not differentiate at t = 1, and we 
cannot apply Theorem [5] directly to study the consistency of hinge loss. Theorem [2] proves its 
inconsistency and also shows the difficulty for consistency without differentiability, even if the 
surrogate loss function 4> is convex and non- increasing with 0'(O) < 0. We now derive some 
variants of hinge loss that are consistent. For example, the q-norm hinge loss: 

4>(t) = (max(0, 1 — t)) 9 for some q > 1. 

Based on Theorem \5\ we can get the AUC consistency of the (/-norm hinge loss: 

Corollary 5 For q-norm hinge loss (p(t) = (max(0, 1 — t)) q with q > 1, the surrogate loss 
4>(f,x,x') = 4>(f(x) — f(x')) is consistent with AUC. 

From this corollary, it is immediate to get the consistency for the least-square hinge loss 4>(t) = 
(max(0, 1 — t)) 2 . We further define the general hinge loss, for any e > 0, as: 

1 - t for t < 1 - e, 

Hi) = { (t-l- e) 2 /4e for 1 - e < t < 1 + e, (7) 
otherwise. 

It is easy to obtain the AUC consistency of general hinge loss from Theorem 03 
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Corollary 6 For general hinge loss (ft given by Eqn. ([7|) with e > 0, the surrogate loss ^(/, x, x') = 
(ft{f(x) — f{x')) is consistent with AUC. 

Hinge loss is inconsistent with AUC, but we can use consistent surrogate loss, e.g., the general 
hinge loss, to approach hinge loss when e — > 0. In addition, it is also interesting to derive other 
surrogate loss functions that are consistent with AUC under the guidance of Theorem 



4. Consistent Bounds 

4-1- Consistent Bounds for Exponential Loss and Logistic Loss 

Corollaries [2] and [3] show that the exponential loss and logistic loss are consistent with AUC, 
respectively. In this section, we further derive their consistent bounds. The exponential loss and 
logistic loss possess a special property: 

Lemma 3 For exponential loss and logistic loss, it holds that 

inf R^f) = E ,„ V 2 inf C(r)(x), r)(x'), a), 
f x a 

Proof: We provide the detailed proof for the exponential loss, and a similar proof can be obtained 
for the logistic loss. Fixing an instance xq £ X and /(xq), we set 

f{x) = f(x ) + - In for x ^ x . 

2 r](x )(l - rj(x)) 

It remains to prove R(f) = E x x ,^ V 2^mi a C(r](x),r](x'),a). Based on the above equation, we 
have, for instances Xi,X2 € X: 

f(x 1 )-f(x 2 )= 1 -ln Vi ^-^ X2 \l 

2 n(x 2 ){i-v(xi)y 

which exactly minimizes C(r](xi), r](x2), a) when a = f{x\) — f(x2), and therefore the lemma 
holds as desired. □ 

It is noteworthy that Lemma [3] is constrained to the exponential loss and logistic loss, and it does 
not hold for other surrogate loss functions such as hinge loss, general hinge loss, g-norm hinge 
loss, etc. For the exponential loss and logistic loss, Lemma [3] shows that minimizing the expected 

14 



risk over the whole distribution is equivalent to minimizing the pairwise-instance conditional 
risk. Based on this property, we can obtain the consistent bounds for the exponential loss and 
logistic loss by focusing on their conditional risks. For a general theory, we consider the following 
equivalence which holds for the exponential loss and logistic loss: 

w£Rf(f) = E x ,^ v : > MC[r){x),rj(x'),a], 

f ' X a 

and we denote by /* the optimal functions, i.e., R^f*) = E XtX /^x> inf a [C(n(x), rj(x'), a)]. Under 
the equivalence assumption, we have 

Theorem 6 For some cq > and < c\ < 1, we have 

R(f)-R*<c (R4f)-R;r, 

if (f*(x) - f*(x'))(r](x) - 7](x')) > for r](x) ^ r](x'), and 

l^-vix^i^coic^M^^-c^^J^^-f^^W 1 - 



This proof is motivated from Zhang Zha04b| and deferred to Section T6.31 Based on this theorem, 



we can get the following consistent bounds for the exponential loss and logistic loss: 



Corollary 7 For exponential loss, it holds that R(f) — R* < . R^f) — R 



Corollary 8 For logistic loss, it holds that R(f) - R* < 2^JR ( f > {f) - R*^ . 

The detailed proofs of Corollaries [7] and [8] are given in Section 16.41 and 16.51 respectively. 

4-2. Consistent Bounds under Non- Noisy Setting 



Now we consider the non-noisy setting [RS09I ] defined as: 



Definition 3 A distribution T> is said to be non-noisy if it holds either r/(x) = or n(x) = 1 for 
every x € X . 

Under such setting, we have 
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Theorem 7 For some c > 0, we have 



R(f)-R* < c (i^(/)-i?;) 



»/ = 0, and j/ <f>(t) >l/cfort<0 and </>(t) > for t > 0. 



Proof: For convenience, denote by T> + and P_ the positive and negative instance distributions, 
respectively. From Eqn. dU), we have 



R(f) = E x ^ +>x ,„ v _[I[f{x) < f(x')} + I[f(x) = f(x')}/2], 

and thus R* = inf/[i?(/)] = when f(x) > f{x'). From Eqn. ©, we get the 4>-Tvsk R^(f) = 
E x ~v + , X >~V-W(x) ~ f{x'))\- Then 



Based on this theorem, we can get the following corollaries under the non-noisy setting: 

Corollary 9 For exponential loss, hinge loss, general hinge loss, q-norm hinge loss, and least 
square loss <p(t) = (1 - t) 2 , we have R(f) - R* < R^(f) - Rl. 

Corollary 10 For logistic loss, we have R(f ) — R* < 2(i?^(/) - RV). 

It is noteworthy that the hinge loss is consistent with AUC under non-noisy setting although 
it is inconsistent for the general case as shown in Theorem [2j Moreover, the consistent bounds 
for the exponential loss and logistic loss under the non-noisy setting are tighter than those of 
Corollaries [7] and [HJ respectively. 

5. Equivalence Between Surrogate Losses of AUC and Accuracy 

In this section, we study the relationships among AUC, accuracy, and their surrogate loss func- 
tions. Our results show that optimizing AUC is more difficult than optimizing accuracy. More 



R(f) - R* 



< 



E x „ v+ , x >~-D-[I[f{x) < f(x')] + I[f(x) = f(x')}/2) 
E*~v + ,»>~V-[c<Kf(*) ~ /(*'))] = <Mf) ~ 



which completes the proof. 



□ 
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interestingly, we establish an equivalence between the exponential surrogate loss of AUC and the 
exponential surrogate loss of accuracy regardless of different formulations. This provides a new 
explanation to the equivalence between AdaBoost and RankBoost: both of them optimize AUC 
and accuracy simultaneously. 

We focus on binary classification and make prediction y = sgn[/(a;)]. Thus, optimizing accuracy 
aims to minimize 

[i[yf(x)<o]] 

= E x [r, (x) I [f (x) < 0] + (1 - r, (x)) I [f (x) > 0]] , 
and it is easy to obtain the set of Bayes predictors for accuracy: 

B acc = {/: f(x)(r,(x) - 1/2) > for v (x) ± 1/2}. 
Recall that the set of Bayes predictors for AUC from Eqn. ([2]): 

B = {f: R(f) = R*} = {/: (f(x) - f(x'))( V (x) - r,(x')) > if V (x) + r,(x')}. 

By comparing the two sets of Bayes predictors, we can find that optimizing accuracy tries to learn 
a function / s.t. sgn[/(a;)] = sgn^a:) — 1/2], yet optimizing AUC aims to learn a function which 
orders instances according to their conditional probability rj(x). It is easy to construct the Bayes 
predictor f£ cc (x) of accuracy from the Bayes predictor f*(x) of AUC by setting f^ cc (x) = f*(x) — 
f*(xo) where ij(xq) = 1/2. The converse direction, however, does not hold because we can only 
order the instances x, x' £ X when r\{x) > 1/2 > 7](x') but fail for (rj(x) — 1/2)(t](x') — 1/2) > 0. 
In this sense, it is more difficult to optimize AUC than accuracy. 



We consider one of the most popular surrogate loss functions of accuracy: 
0acc(/(aO,y) = <t>(yf{x)) 



where (j) is conv ex and non-increasing, e.g., the hinge lo ss (j){t) = max(0, 1— t) Vap98l ] . exponential 



loss (j){t) 



FS97J, logistic loss <f>{t) = ln(l + e" 4 ) jFHTOOl ]. etc 



We can also define the ^ acc -risk as R^ cc {f) = £p[<?W (/(#), V)] 
Since the surrogate loss 4> acc focuses on single instances, we have 



inf R^Jf) = E x infJCWrKtf), /(*))], 
/ fix) 



Ex>[4>{y f {x))\ for accuracy. 



(8) 
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where the conditional risk C a cc(r](x), f{x)) = r)(x)(p(f(x)) + (1 — rj(x))(j)(—f(x)). In other words, 
minimizing the expected risk over the whole distribution is equivalent to minimizing the condi- 
tional risk on every instance. Thus, it is suffic ient to study the consistency of accuracy based 



on conditional risk as done in 



BJM06 



Zha04b| . This is quite different from our work on AUC 



consistency. The surrogate loss function for AUC is defined on a pair of instances, and for 
some surrogate loss functions, minimizing the expected risk over the whole distribution is not be 
equivalent to minimizing the conditional risk on every pair of instances, as shown by Lemma [TJ 
Therefore, the study on the consistency of AUC is more difficult than the consistency analysis of 
accuracy. 



In what follows, we will study the relationship between the surrogate loss of accuracy, (f) acc (f(x),y) = 
cf)(yf(x)), and the surrogate loss of AUC, 4>(f,x,x') = cf)(f(x) — f{x')), especially for (f>(t) = e~ l 
(exponential loss). The following lemma shows that the exponential surrogate losses of accuracy 
and AUC have the same optimal solution: 

Lemma 4 The optimal functions of the exponential surrogate loss of accuracy E^ x ^^x>[ e ~ v ^ x ^] 
optimize the exponential surrogate loss of AUC 

E XtX ,^ v% {n{x)(l-n{x'))e-f^ + f^ 

and the converse direction holds by fixing f(xo) = for 1](xq) = 1/2. 

Proof: From Lemma [3] and Eqn. ([8]), it suffices to proceed on conditional risk. Minimizing the 
accuracy's conditional risk rj(x)e~^ x ^ + (1 — r/(x))e^ x ^ gives the optimal solution f^ cc (x) = 
0.51n(r;(£c)/(l — n(x))). On the other hand, minimizing the AUC's conditional risk 

r){x)(l - 7]{x'))e- f W +f ( x '^ + n{x'){\ - V (x))e^^ x '^ +f ^ 
gives the optimal solution 

f*(x) - / V) = 0.51n(7/(a;)(l - n(x')/n{x')/{l - n(x))) = f: cc (x) - f^x'), 
which completes the proof by simple analysis. □ 

Similar result also holds for logistic loss 4>(t) = ln(l + e~*). Based on this lemma, we can further 
derive the following theorem, whose proof is deferred to Section 16.61 
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Theorem 8 For exponential loss and sequence {f^}n>i, we have R-^(f^) —> R^, if R^acdf ) ' 
R% aJ we also have R* a Jf {n) ) ^ R% acc if R*(f {n) ) -> by setting /(»>(s ) = for r,(x ) = 
1/2 and n > 1. 

This theorem discloses the asymptotic equivalence between the exponential surrogate loss of accu- 
racy and the exponential surrogate loss of AUC. Thus, the accuracy's surrogate loss 4>acc(f{x),y) = 
e -yf( x ) is consistent with AUC, whereas the AUC's surrogate loss <j)(f,x,x') = e~^^~^ x '^ is 
consistent with accuracy by choosing a proper threshold. One direct consequence of this theorem 
is: AdaBoost and RankBoost are equivalent asymptotically, i.e., both of them optimize AUC 
and accuracy simultaneously for infinite training sample, because AdaBoost and RankBoost es- 
sentially optimize the surrogate loss (j>i icc (f(x),y) = e~ v ^ x ^ and 4>(f,x,x') = e~^^~^ x '^\ 
respectively. It will be interesting to make similar consideration for the logistic loss, and we leave 
it to future work. 

6. Proofs 

In this section, we provide some detailed proofs. 
6. 1 . Proof of Lemma 

Proof: If (f> is not classification-calibrated, then there exist r/o and r]' Q such that tjq > r]' (without 
loss of generality) and H~(rjo,r]' ) = H(r]Q,r]' Q ), i.e., 

inf {770(1 - r? o )0(a) + t/ (1 - r/ o )0(-a)} 

inf {770(1 - rj' )(j)(a) + r&(l - r? o )0(-a)} . 

a: a(r] -r] )<0 

This implies that there is a ao < such that 

7/0(1 - rj' )</)(ao) + 7? (1 - »yo)^(-ao) = inf {??o(l - Vo)H a ) + Voi 1 ~ Vo)H-a)} . 

Suppose that the instance space X = {031,052} with marginal probability Prfajj] = 1/2 and 
conditional probability rj(xi) = 7/0 and 7/(332) = Vo- We construct a sequence {/^ n ^} n ^i by 
picking up f^ n \xi) = f^(x 2 ) + ao, and it is easy to get that 

RM (n) ) -> R% ^(/ M ) — R* = (t/o - 7/ )/8 as n 00, 
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which implies that is inconsistent with AUC. Therefore, classification calibration is necessary 
for AUC consistency. 



For classification, Bartlett et al. BJM06I ] established that, for convex 0, classification calibration 
is equivalent to the condition that is differential at t = and <f>'(t) < 0, whereas no study is 
provided to guarantee such equivalence for AUC. Therefore, we present the complete proof that, 
for convex 0, the condition that is differential at t = with 0'(O) < is necessary for AUC 
consistency. 

We consider the instance space X = {2:1,2:2} with marginal probability Pr[xi] = Pr^] = 1/2 
and conditional probability 17(3:1) = 771 and 77(2:2) = 772. We first prove that if the consistent 
surrogate loss is differential at t = 0, then 0'(O) < 0. Assume (f>'(0) > 0, and for convex (ft, we 
have 

571(1 - V2)(ft( a ) + mi 1 - ??i)<?H-«) 

> (Vi - % W(0) + (r7i(l - 772) + 772(1 - 7?i))0(O) 

> (771(1 - 772) + 7/2(1 - 7/i))0(O) for (771 - 7/ 2 )a > 0. 

Therefore, we have 

H(r]i,m) = inf {?7i(l -772)0(0;) +772(1 -Vi)4>{-a)} 

= mkJ inf {t/i(1 - 7/ 2 )0(a) + 7/ 2 (l - 7/i)</>(-a)} , 

l k (r?l-'72)a>0 

inf {7/1(1 - 7/ 2 )0(a) + 772(1 - rji)cft(-a)} 

(r?l-r?2)a<0 

= min{{77i(l - 7/2)0(0) + 7/2(1 - 77i)0(O)} , 

inf {771(1 - 7/ 2 )0(a) +7/2(1 - 7?i)0(-a)} 
= inf {771(1 - 772)0(0) + 772(1 - 77i)0(-a)} 

(Vi— V2)a<0 

= H-(m,m), (9) 

which implies that is not classification-calibrated. This is contrary to the assumption that is 
consistent with AUC from previous analysis,. 
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We now prove that convex loss must be differential at t = if it is consistent with AUC. 
Suppose that is not differential at t = 0. Then, we can find subgradients g± > 52 such that 

<P{t) > git + 0(0) and <p{t) > g 2 t + (f)(0) for t G M, 

and we consider the following cases: 

1. If gi > g2 > 0, then we choose r/i = gi/(g\ + 52) and 7/2 = 52/(51 + 52)- It is obvious that 
r]i > 7/2, and for any a > 0, we have 

?7i(l - 772)0(0) + %(! - t/O0(-o) 

> r/i(l - 7? 2 )(52a + ^(O)) + 7/2(1 - t/0(-5io + 0(0)) 

= (77152 - 7/251)0 + (51 - 92)mm a + (771(1 - m) + 772(1 - t/O)0(o) 
= (51 - 92)vima + (771(1 - 772) + 772(1 - m))0(o) 

> (r?i(l- 7/ 2 ) + 772(1 -?7i))<K0); 

2. If 5i > > 52 or (71 > > 52, then we choose r/i = 1 and 7/2 = 1/2, and for any a > 0, it 
holds that 

77i(l - 772)0(a) + 772(1 - r/i)0(-a) 

> 7?i(l - 7? 2 )(5ia + 0(0)) + 7/2(1 - 7?i)(-5 2 a + 0(0)) 
= 5ia/2 + (771(1 - 772) + 772(1 - t?i))0(O) 

> (171(1-772) + 7/2(1 -7/0)0(0); 

3. If > 51 > 52, then we choose 771 = (|5i| + |5i - 52|/2)/(|5i + 52I) and r? 2 = |5i|/(l5i + 52|)- 
It is obvious that r\\ > 7/2 and for any a > 0, we have 

7?i(l - 7/2)0(a) + 7/2(1 - 7/i)0(-a) 

> 7/i(l - 7/ 2 )(5ia + 0(0)) + 7/2(1 - 7/i)(-5 2 a + 0(0)) 

= (t/151 - 7/252)a + (52 - 5l)7/l7?2a + (7/l(l - 7/2) + 7/2(1 - 7/l))0(O) 
= (7/l(l -7/2) +7/2(1 -7/0)0(0). 

Therefore, for any 51 and 52, there exist 7/1 and 7/2 such that 

7/i(l - 7/2)0(a) + 7/2(1 - 7/O0(-o) > (7/i(l - 7/2) + 7/2(1 - 7/0)0(0) for (7/1 - 772)0 > 0. 
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Similarly to Eqn. ([9]), we have #(771,772) = H (771,772), and thus 4> is inconsistent with AUC, 
which is contrary to the assumption. This completes the proof as desired. □ 

6. 2. Proof of Theorem [5| 

We begin with the following lemma, which is crucial to the proof of Theorem [5j 

Lemma 5 For surrogate loss 4>(f,x,x') = (j)(f(x) — f(x')), it holds that 

inf R^(f) >infi? (/) 
f<£& f 

if 4> : M — > E is a convex, differential and non-increasing function with (p'(0) < 0. 

Proof: From the 0-risk's definition in Eqn. ([3]), we have 

R*(f) = C + Y, PrNPr[x'](7 ? (x)(l-7 ? (x'))0(/(x)-/(x')) + 

x,x'ex 

7 ? (x')(l-7 ? (x))«/»(/(x / )-/(x)) 

where Co is a constant with respect to /. We proceed by contradiction, and suppose that 
inf/gB R<t>{f) = inf/ R<j,(f). 

This implies that there exists an optimal function /* such that RAf*) = inf/ RAf) an d /* ^ B, 
i.e., for some xi,X2 G X, it holds that f*{x\) < f*{x2) yet 77(3:1) > 77(^2)- 

Since </> is convex and differential, the subgradient conditions for minimizing RAf) give 

0, 



BR ^ -0 and r«W> 



•9/(a?i)J/(*i)=/'(*0 ' L9/(x 2 ) 

which are equivalent to 



/(x 2 )=/*(x 2 ) 



J] Pr[x] (77(3:0(1 - ri{x))4f<J*{xi) - f (x)) - r/(x)(l - r,( Xl ))<l>' {f* (x) - f (xi))) = 
Y Pt[x] (r/(x 2 )(l - 77(x)) < /»'(r (x 2 ) - /* (x)) - r/(x)(l - t?^))^/* (x) - /* (x 2 ))) = 0. 
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This follows 

(Pr[xi]+Pr[x 2 ])(»/(xi)(l-^^ 

+ Yl Pr[*Mao((i - ii(x 2 )W{r(x) - r (x 2 )) - (i - »/(xi))^(r(x) - r^o)) 
+ ^ Pr[x](i-7 ? (x))(7 ? (x 1 )^(r(x 1 )-r( 2; ))-7 ? (x 2 )0 / (r(x 2 )-r(x))) =0. (io) 

Since is convex, differential and non-increasing, we have 4>'(ti) < </>'(t 2 ) < when t\ < t%. 
Therefore, it holds that <£'(/*(xi) - f*(x)) < &'{f*(x 2 ) - /*(x)) < if /*(a?i) < f*(x 2 ). This 
follows 

(X!) - f (x)) - 7 ? (x 2 )0 / (/*(x 2 ) - /* (X)) < (11) 

for r](xi) > f/(x 2 ). In a similar manner, we have 

(1 - 7 ? (X 2 ))0'(/*(X) - /* (X 2 )) - (1 - 7 7 (X 1 ))0'(/*(X) - /* (Xi)) < 0. (12) 

For the case f*(x\) = /*(x 2 ), we have 

r/(x!)(l - f? (x 2 ))0'(/*(x 1 ) - /*(x 2 )) - t?(x 2 )(1 - 7 7 (x 1 ))0'(/*(x 2 ) - /* (xi)) 

= (r/(xi) - 7?(x 2 ))0'(O) < 

from 0'(O) < and r]{x\) > ??(x 2 ), which is contrary to Eqn. (fTUj) by combining Eqns. (fTTj) and 

For the case /*(xi) < /*(x 2 ), we have <j)'(f* (x x ) - /*(x 2 )) < </>'(0) < and </>'(/* (xi) - /*(x 2 )) < 
0'(/*(x 2 ) - f*(xi)) < 0. This follows that, for r/(xi) > r/(x 2 ), 

7 ? (x 1 )(i - f? (x 2 ))0 / (/*(x 1 ) - r(x 2 )) - f? (x 2 )(i - r ? (x 1 ))0 / (r(x 2 ) - r (xo) < o 

which is also contrary to Eqn. (jlpp by combining Eqns. (jlip and (|12p . Hence, this lemma follows 
as desired. □ 

Proof of Theorem [5], From Lemma [5l we set 
5=mf^(/)-mf^(/)>0. 
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Let {/<»>} 

n >o be an any sequence such that R^f^) — > Then, there exists an integer iVo > 

such that 

RM (n) )- R l< 6/2ftxn>N . 
This immediately yields that f( n ' £ B for n > from the contrary that 

Mf) - R l = Mf) - in i Mf) + W) - > & if / g B. 

v f'(£B f'<£B v 

Therefore, we have R(f^) = R* for n > A^, which completes the proof. □ 

6. 3. Proof of Theorem [6] 

From Eqns. ([TJ and ([2]), we have 
R(f) - R* 

= E r,(x)>r,(x>),f(x)<f(x>) iv(x) ~ r)(x')] + E v ( x )> v (x>)j(x)=f(x>) [r}(x)/2 - r](x')/2] 

+ E r 1 (x)<r,(x')J(x)>f(x')[v( x ') ~ + E ^x)<r,(x')J(x)=f(x')[v( x ')/ 2 ~ V( x )/ 2 } 

= E (v(x)-v^'))(f^)-f^'))<o[\v{x) - rj(x')\] + ^E f{x)=f{xl) [\r]{x') - rj(x)\] 

< E (r){x)-'q{x')){f{x)-i{x'))<Q[\'n{x) - 7](x')\] 

< E (ri(x)- v (x'))(f(x)-f(x'))<o[co(C(r](x),r](x'),0) - C(r/(x), i](x'), f*(x) - f*(x'))) Cl ], 

where the last inequality holds from our assumption. By using the Jensen's inequality, we further 
obtain 

R(f)-R* < c ( E ( v (x)- v (x'))(f(x)-f(x'))<o[C(v( x ),v( x '),0) -C(r)(x),r)(x'),f*(x) - f*(x'))}) Cl 
for < ci < 1. This remains to prove that 

E (rj(x)- v (x'))(f(x)-f(x'))<o[C(v(x),v( x ')^) ~ C(r](x),r](x'),f*(x) - f*(x'))j 

< E (r 1 (x)-r 1 (x'))(f(x)-f(x'))<o[C(v(x),v(x')J(x) - f(x')) - C(rj(x) , rj(x') , /* (x) - f*(x'))\ 

= R<t>U) - 

To see it, we consider the following cases: 

• If rj(x) = rj(x') then C(t](x), rj(x'), 0) < C(rj(x),r](x') 1 f(x) — f(x')) since <j) is convex; 
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. If f{x) = f(x') then C( V (x), V (x'),0) = C( V (x), r,(x'), f(x) - f(x')); 

• If (i](x) - 7](x'))(f(x) - f(x')) < 0, then (f{x) - f(x'))(f*(x) - f*(x')) < from the 
assumption (f*(x) — f*(x'))(r](x) — r/(x')) > 0. Thus, is between f(x) — f(x') and f*(x) — 
f*(x'), and for convex function cp, we have 

C(r,(x), r,(x'), 0) < max(C(r ? (x), ^(x'), /(x) - f(x% C( V (x), V (x'), f*(x) - f*(x'))) 

= C(r ] (x),r ] (x')J(x)-f(x')). 

Therefore, this theorem follows as desired. □ 
6.4- Proof of Corollary [?] 

For exponential loss (p{t) = e~ l , we have the optimal function /* such that 

by minimizing the conditional risk C(r](x),r](x'), f(x) — f(x')), and this follows 

(f*(x) - f*{x'))(7]{x) - r](x')) > for rj(x) + rj(x'). 
From Eqn. (|13p . we have 

C( V (x),r)(x')J*(x) ~ rM) = 2^ v (x) V (x')(l- V (x'))(l- V (x)), 
and it is easy to get C(rj(x),rj(x'),0) = r](x)(l — i](x')) + r](x')(l — r](x)). Therefore, we have 

C( V (x), V (x'),0) - C(rj(x), r,(x'), f*(x) - f*(x')) 



= (Vv{x)(l - r](x')) - a/t?(x')(1 - r/(x))) 
\r](x) — 77(x') | 2 
(v /r ?( x )( 1 ~ nix')) + \/r){x'){\ - rj(x))) 2 
> \rj(x) - rj(x')\ 2 , 

where the last inequality holds from r/(x),r/(x') G [0,1]. Hence, this lemma holds by applying 
Theorem [U] to exponential loss. □ 
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6.5. Proof of Corollary^ 



For logistic loss cf>(t) = ln(l + e *), we have the optimal function /* such that 
f * (x) _ r(xl) = ln V(x)(l-r,(x')) 

1 {X) 1 { ' r](x')(l - rj(x) ' 1 ' 

by minimizing the conditional risk C(rj(x), r](x'), f{x) — f(x')), and this immediately yields 

(f*(x) - f*(x'))(r](x) - 7]{x')) > for rj(x) ^ rj(x'). 

Therefore, we complete the proof by applying Theorem [6] to logistic loss if the following holds: 

C(r,(x),r,(x'),Q)-C( V (x), V (x'),r(x)-r(x')) > \r,(x) - V (x')\ 2 /4. (15) 

We will prove that Eqn. (|15p holds for \rj(x') — 0.5 1 < \r](x) — 0.5|, and similar derivation could 
be made when \r](x') — 0.5| > \r](x) — 0.5 1 . For notational simplicity, we denote by r/ = r](x) and 
r]' = r](x'). Fix 77' and we set 

F( V ) = C(r,, rf, 0) - rf, f*(x) - f*(x')) - (r, - r/) 2 /4. 

From Eqn. (|14p . we further get 

F(r]) = ln(2)(r] + rj' -2r]'r]) - (t]-t]') 2 /A 

It is easy to obtain F(jj') = and the derivative 

F'{rj) = ln(2)(l - 2rf) - (r]-r]')/2 

-( 1 -^+^WKi+^). 

Further, we have F'(rf) = and the second-order derivative 

F»M = I > 0, 

r/(l — 77) (77 + rf — 2r]r]') 2 

where the inequality holds since 77 + 77' — 2r]r/' = 7/(1 — r/') + t/(1 — 77) < 2 and r]'(l — rj') > 7/(1 — 77) 

from assumption \r]' — 0.5 1 < \rj — 0.5|. Therefore, F'{rj) is a non-decreasing function, and this 

yields that 

F'irf) < F'{rf) = for 7/ < 77', and F'(r]) > F'{rf) = for 77 > 77', 
which implies that F{rf) > F(r)') = 0. Therefore, we complete the proof. □ 
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6. 6. Proof of Theorem 

We first introduce a lemma for exponential loss as follows: 
Lemma 6 For some cq > 0, we have 

if E x [(l — rj(x))e^ x ^] < cq; we also have 



R^M) - Rl cc < 2^ R^f) - r* 4 
if E x [ri(x)e-fW] = E X [{1 - 7](x))e^^]. 



Proof: For accuracy's exponential surrogate loss, we have 



R 4>^Sf) - R%» 



E. r 



E, 



r](x)e-fW + (1 - r](x))e f W - 2 v / r/(x)(l - rj{x)) 

2 



r](x)e-f( x 1 - J(l - r]{x))e^ x ) 



and similar results holds for AUC's exponential surrogate loss as follows: 



R+U) -Rl = E x , x > [(yjrj(x)(l-r,(x'))e-f(*)+f(*') 



- ^r](x'){l -r ? (x))e/W-/(^')y 



For Eqn. (|T6j) . we have 



R^(f)-R; < 2^[(l-r ? (x'))e^')]^[(^r / (x) e -/W-^(l-r ? (x))e/W) 



+2E X [(1 - n(x))eW)E xl [(01 - ri(x'))efW - ^ V (x')e-f (*'))' 



by using the fact 



(^■q{x){l - ri(x'))e-fW+f( x ') - ^rj(x')(l - r](x))efW-f( x ') 
< 2{l-7 1 {x'))e^ x ' ) [yjr,{x)e-f^) - yj (1 - ^{x^ef^ 
+ 2(1 - r ? (x))e / ^ - r]{x'))ef( x ') - 



Therefore, Eqn. (fT6j) holds by using < Cq. 
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From Eqn. (|18p . we have 



(^acc(/)-^ 



* \2 

acc 



(y/ri(x)e-H x ) - y/ (1 - rj(x))efto^ r](x')e-fto) + ^/(l - ^(x'))^'))' 
By using (a + 6) 2 < 2(a 2 + 6 2 ), we further get 



{R^Af)-R 



* \2 

acc ' 



(y^Kl - r ? (x'))e-/W+/^') - y/ri{x'){\ - rt{x))efto-fW) 



2i 



Ur]{x)T){x')e-fto-f{x') - y (1 - r?(x))(l - 7 ? (x , ))e/W+/( :c ')y 



We complete the proof of Eqn. (fT7|) since the second term in the above is equal to 2(R ( p(f) — R*^) 
from E x \q(x)e~fto] = E x [(l — rj(x))efto] m The lemma follows as desired. □ 



Proof of Theorem [8l From Eqn. (fT8|) . we have 

^rj(x)e-f {n) ( x ) - ^/(l - r,{x))ef {n) to -> 

almost surely as n -> oo if i^ acc (/^) -> -R^ acc - This follows that £^[(1 - rj(x))ef n to] < 1 as 
n — > oo, and we complete the first part of Theorem [8] from Eqn. (|16p . 

From Eqn. (|19p . we have 

^(x)(l - r ? (a; / ))e-/ <n> ( a; )+/ <n> ^') - \] r){x'){l - i](x))ef {n) ( x )-f {n H*') -> 

almost surely as n — >■ co if i^(/< n) ) -> i?;. This follows that E x [r){x)e-fto] = E x [(l-r](x))e^ x )] 
when f^(xo) = for rj(xo) = 0.5. This completes the second part of Theorem [8] from Eqn. (|17p . 
□ 



7. Conclusion and Open Problems 

AUC (area under ROC curve) is a popular evaluation criterion widely used in diverse learning 
tasks. Many convex surrogate loss have been explored to optimize AUC owing to its non-convexity 
and discontinuousness. Therefore, it is important to study the consistency of learning algorithms 
based on surrogate losses. 
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Previous study showed that classification calibration is equivalent to AUC consistency, whereas we 
find that it ignores an important prerequisite: for the pairwise surrogate loss of AUC, minimizing 
the expected risk over the whole distribution is not equivalent to minimizing the conditional risk 
on each pair of instances. We disclose that classification calibration is necessary yet insufficient for 
AUC consistency, e.g., hinge loss and absolute loss are classification-calibrated whereas they are 
inconsistent with AUC. We provide a new sufficient condition for the asymptotic consistency of 
learning approaches based on surrogate loss functions, and based on such finding, many surrogate 
losses are proven to be consistent such as exponential loss, logistic loss, least-square hinge loss, 
etc. We also derive the consistent bounds for exponential loss and logistic loss, and obtain the 
consistent bounds for many surrogate loss functions under the non-noise setting. Furthermore, we 
disclose an equivalence between the exponential surrogate loss of AUC and exponential surrogate 
loss of accuracy, and one straightforward consequence of such finding is that AdaBoost and 
RankBoost are equivalent. 

Many problems are left to future work. For example, the first open problem is to study the 
necessity of the condition that 4> is non-increasing in Theorem It is natural to consider the 
least square loss (f>(t) = (1 — t) 2 , which is convex, differential with <p'(0) < 0, yet increasing for 
t > 1. Actually, it is difficult to study the consistency of least square loss, and let us see some 
simple cases: 

• If X = {2:1,2:2} with marginal probability Pr[a;j] and conditional probability 7/(2^) (i = 
1,2), then minimizing R<f,(f) gives the optimal solution / = (f(xi),f(x2)) s.t. 

/(2;i) - f(x 2 ) = sga(rj(xx) - 7/(2:2)) for 77(2:1) ^ 77(2:2), 

which implies least square loss is consistent with AUC when X = {x\, x 2 }- 

• If X = {2:1,2:2,2:3} with marginal probability PrfcEj] = pi and conditional probability 

r]{xi) = r]i (1 < i < 3), then minimizing R<j,(f) gives the optimal solution / = (f(xx), f{x2), f{x^)) = 
C/1,/2,/3) such that 

fl ~ /2 = (?7l - V2)(Pl(m + % - 2t7i7/ 3 ) + p 2 {rj 2 + 7/3 - 27/27/3) + 2p 3 (7/3 - 7/|))/A 

/1-/3 = - %)(Pi (Vi +V2 -27717/2) + 2p 2 (7?2 - +P3O72 + % - 2t72%))/A 

/2-/3 = (?72- 7? 3 )(2pi (t/1 - 7/i) +p 2 (m + 7/2 - 27/17/2) +P3(77l 27/i7/ 3 ))/A, 
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where A = pi (771 + 772 - 2771 772 + 7/3 - 2771773) + p 2 (rji + 772 - 2771772X772 + 773 - 2772773) + 
PsC 7 ?! + V3 ~ 2771773) (772 + 773 — 2772773). Therefore, least square loss is consistent with AUC 
when X = {x\, x 2 , 2:3}. 

• For X = {xi, x 2 , . . . , Xk} (k = 4 or k = 5), we also find the optimal solution / s.t. 

f(xi) - f(xj) = (r](xi) - Ti(xj))Aij for i ^ j, 

where Ajj > have very complicated expressions and we omit them here. Therefore, least 
square loss is also consistent with AUC. 



For more general cases, it is reasonable to conjecture that the optimal solution / has the form of 
f(xi) — f(xj) = (r/(xi) — T)(xj))Aij, which shows that least square loss is consistent with AUC, 
whereas we fail to prove and suggest it as a conjecture: 

Conjecture 1 For least square loss <j){t) = (1 — t) 2 , the surrogate loss $f(f,x,x') = 4>(f(x) — 
f{x')) is consistent with AUC. 



Another relevant loss function cj){t) = |1 — i| 5 has been suggested by Breiman Bre99l | to design 
the boosting algorithm arc-x4, and it also remains open to study the consistency of surrogate 
loss fy(f,x,x') = |1 - (f(x) - f(x')) | 5 with respect to AUC. 

For AUC consistency, we have presented a necessary condition (Lemma [2]), i.e., a consistent and 
convex surrogate loss 4>(t) must be differential at t = and ^'(0) < 0; on the other hand, we have 
also given a sufficient condition (Theorem [5]) , i.e., surrogate loss (f> is consistent with AUC if it 
is differential, convex and non-increasing with <p'(0) < 0. Therefore, an interesting work is to fill 
the gap between the sufficient condition and necessary condition. It seems difficult to convince 
the necessity of the condition that 4> is non-increasing in Theorem [5l and even for least square 
loss, it still remains open to discuss on its consistency. Therefore, it is a big challenge to find the 
necessary and sufficient condition for AUC consistency, and we leave it as an open problem. 

In addition, our work could motivate the consistency study on other criterions such as recall, 
precision, Fi-score, etc. 
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